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(57) A self-organizing algorithm for the massive 
analysis of gene expression data from DNA array ex- 
periments, beginning from a structure composed of two 
"daughter" neurons connected to a "mother" neuron, 
which are composed of a list of values or profile of ini- 
tially random data. The algorithm divides the data set 
into data sub-sets in a successive series of cycles. Each 
of the cycles has a series of stages in which the data is 



introduced to the terminal neurons and the latter are up- 
dated. When the updating produced between two con- 
secutive stages is below a 1 certain pre-established min- 
imum level, it is considered that the network has con- 
verged in that cycle and it is decided it the network 
should continue to grow or if it has already stopped 
growing, in which case the algorithm process is final- 
ized. 
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Description 

OBJECT OF THE INVENTION 

[0001] The present invention refers to a self-organiz- 
ing algorithm of gene expression data on the basis of 
which it is possible to implement a process for the mas- 
sive analysis of 9 ene expression data from DNA array 
experiments, providing essential novelty features and 
significant advantages with respect to the known proc- 
esses in the current State of the Art and intended for 
these same purposes. 

[0002] More specifically, the invention proposes the 
development of a process beginning from a hierarchical 
structure composed of two "daughter" neurons connect- 
ed to a "mother" neuron, which are composed of a list 
of values or profile, initially of random data. In a series 
of successive cycles, the algorithm divides the data set 
into sub-sets. Each of the cycles has a series of stages 
in which the sub-set data is introduced to the terminal 
neurons and the latter are updated. When the updating 
produced between two consecutive stages is below a 
certain pre-established minimum level, it is considered 
that the network has converged in that cycle and it is 
decided if the network should continue to grow, adding 
neurons in lower levels, or if it has stopped growing, in 
which case the algorithm process finalizes. 
[0003] . The field of application of the present invention 
is comprised within the biomedicine or bioscience indus- 
try, as well as within the food and agriculture sectors. 

BACKGROUND AND SUMMARY OF THE INVENTION 

[0004] Currently, one of the largest problems associ- 
ated with the use of DNA array techniques is the enor- 
mous volume of generated data. 
[0005] The development of a SOTA (Self Organizing 
Tree Algorithm), (Dopazo and Carazo, 1997) neuronal 
network is known, originally designed for sorting se- 
quences. 

[0006] The process object of the present invention is 
based on said neuronal network but adapted to the mas- 
sive analysis of gene expression data. 
[0007] Specifically, the process is capable of finding 
clusters in which the gene expression is similar and sep- 
arating them from those showing different expression 
patterns. The divisive form in which it functions implies 
that it divides the data set into sub-sets until def ining the 
clusters with a similar expression. In addition, it does so 
hierarchically, in other words, the cluster groups are de- 
fined as a hierarchy of likenesses in which the most sim- 
ilar clusters are clustered into larger clusters and so on 
and so forth until describing the complete data set as a 
binary tree (or with multi-branches if it were necessary). 
[0008] In this way, the self-organizing algorithm object 
of the invention achieves comparing the expression lev- 
els of different genes under different conditions such as 
temperature, a compound dose, patient tissue, etc. 
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[0009] Each DNA array, representing a condition, 
contains measures corresponding to the expression lev- 
el of a series of genes understudy (see Eisen et al, 1 998 
for details of the experiment). In the self-organizing al- 

s gorithm, the gene expression profile or pattern is the list 
of expression values obtained in the different conditions 
(in other words, from each array). To facilitate the com- 
parison between profiles, some mathematical transfor- 
mation (logarithm and normalization or the like) is usu- 

io ally performed. Said comparison between profiles is 
done by means of a distance function. The most widely 
used are Euclidean distances (the sums of the absolute 
values of the item to item differences between the profile 
conditions) or correlations (measuring the likeness of 

'5 the profile tendencies). The distance function is simply 
an objective way to describe the difference magnitude 
between two expression profiles. 
[0010] More specifically, the algorithm begins from a 
structure composed of two "daughter" neurons connect- 

20 ed to a "mother" neuron. Each of said neurons is a list 
of values or profile. Initially, the neurons are a collection 
of random numbers. The algorithm divides the data set 
into sub-sets in a successive series of cycles. Each of 
the cycles has a series of stages in which the data is 

25 introduced to the terminal neurons and the latter are up- 
dated. A stage consists introducing the entire data set 
to the terminal neurons. As the stages elapse, the neu- 
rons become updated, being set to mean values ob- 
tained from the data set. When the updating produced 

30 between two consecutive stages is below a certain pre : 
established minimum level, it is considered that the net- 
work has converged in that cycle and it is decided if the 
network should continue to grow or if it has stopped 
growing, in which case the algorithm process finalizes. 

35 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0011] Other features arid advantages of the inven- 
tion will be clearly shown based on the following descrip- 
40 tion of an embodiment example, carried out in a non- 
limiting and illustrative manner, making reference to the 
attached drawings in which: 

Figure 1 shows the flow chart of the self-organizing 

45 algorithm object of the invention 

Figure 2 schematically shows the adaptation and 
propagation manner of the likeness of the value of 
the neurons comprising the hierarchical structure of 
the network obtained by the algorithm of the previ- 

so ous figure. 

Figure 3 shows a scheme of the updating manner 
and proximity between neurons. 

PREFERRED EMBODIMENT OF THE INVENTION 

[0012] To carry out the detailed description of the pre- 
ferred embodiment of the invention, permanent refer- 
ence will be made to the drawings, in which figure 1 
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shows the flow chart of the self-organizing algorithm, 
which begins (1) from a structure composed of two 
"daughter" neurons connected to a "mother" neuron, 
each of said neurons being a list of values or profile (2), 
which are initially a collection of random numbers. 
[0013] The next step consists of dividing the data set 
into data sub-sets in a series of successive cycles (3) in 
the following manner: 

[0014] Each of the cycles (3) consists of a series of 
stages (4) in which the data is introduced to the terminal 
neurons (2) and the latter are updated. 
[0015] A stage (4) consists of introducing the entire 
data set to the terminal neurons (2). In this way, as the 
stages elapse (4), the neurons become updated, being 
set to mean values obtained from the data set. 
[001 6] Between two consecutive stages (4), the con- 
vergence condition (5) of the cycle is checked, such that 
if the updating produced between two stages (4) is be- 
low a certain pre-established minimum level, it is con- 
sidered that the network has converged in that cycle. 
[0017] Next, after the cycle convergence, the conver- 
gence condition (6) of the network is checked, such that 
it is decided if the network should continue to grow, add- 
ing (7) a neuron and again initializing the cycle (3), or if 
it has already stopped growing, in which case the algo- 
rithm process is finalized (8). 

[0018] The neuron updating is performed by using the 
relative network error. This is obtained from the re- 
source: .' . 




which is the mean of the distance values between each 
profile and its corresponding winning neuron (2). The 
error is defined as: 




<Threshold 



[0019] In otherwords, the relative increment in the re- 
source value. When a significant drop in the resource 
value R is not achieved, it is considered that at the level 
of this cycle (3) the network has converged. 
[0020] As has been explained, a stage (4) consists of 
introducing the entire data set to the terminal neurons 
(2), each introduction consisting of two steps: first, find- 
ing the neuron most similar to the profile being present- 
ed (winning neuron) among the terminal neurons, in oth- 
er words, the neuron at the shortest distance to it, and 
next, the neuron and its neighborhood are updated, 
such as is shown in figure 3. 



[0021] To prevent asymmetrical updating, the neigh- 
borhoods are defined as shown in figure 3. If it were not 
done in this manner, the upper neuron in the example 
of said figure would be updated by its daughter neuron 
5 on the left but not by the one on the right, which already 
has two descendents. 

[0022] The updating process consists of these neu- 
rons modifying their values by means of the following 
formula (Kohoneo, 1990): 

C J (x+1)=C / (x) + T 1 .(P r C ( .(x)) 

where n isthe magnitude factor of the updating of the i 

is neuron dependent on its to the "winning" neuron within 
the neighborhood; C t (t) is the i-th neuron in the introduc- 
tion number t h and P,- is the j-th expression profile. 
[0023] In this manner, the neurons adapt their values 
to a mean value representative of the profile cluster as- 

20 sociated to it. This value is very interesting because it 
represents the mean profile. At the same time, they 
propagate this likeness through the network as shown 
in figure 2 with n decreasing values. 
[0024] Lastly, it is worth mentioning the possibility of 

25 stopping the network growth by means of a threshold 
for the resources the user can set, either it is possible 
to define a criteria based on the data in order to decide 
in which cycle (3) the network stops growing. For this, 
the original profiles are taken and random permutations 

30 are carried out on each of them in the order of the ex- 
pression values. In this way, the correlation existing be- 
tween profiles of the different genes is destroyed. Then, 
the distance values for each profile pair are calculated, 
and the distributions of values randomly, observed is 

35 shown. From this distribution, a distance value can be 
found, appearing with a very low probability simply ran- 
domly: for example, the value randomly observed in the 
99% to 99.9% percentile. If the network growth stops 
when all the local resources are below the set distance 

40 value, we have profile clusters having a significantly 
higher likeness between than what would be expected 
randomly. 

[0025] It is not necessary to extend the content of this 
description so that an expert in the matter can under- 
45 stand its scope and the advantages derived from the in- 
vention, as well as developing and putting into practice 
the object thereof. 

[0026] However, its must be understood that the in- 
vention has been described according to one preferred 
so embodiment thereof, due to which it is susceptible to 
modifications without implying any alteration whatsoev- 
er to its scope. 



55 Claims 

1. A self-organizing algorithm for the massive analysis 
of gene expression data from DNA array experi- 
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merits, based on the development of a SOTA (Self 
Organizing Tree Algorithm) neuronal network, 
characterized in that it is developed in the follow- 
ing manner: 

5 

it begins (1 ) from a structure composed of two 
"daughter" neurons connected to a "mother" 
neuron, each of said neurons initially being a 
list of random values or random profile, 

- the data set is divided into subsets in a succes- 10 
sive series of cycles (3), each cycle (3) having 

a series of stages (4) in which the data is intro- 
duced to the terminal neurons (2) and the latter 
are updated, being set to mean values obtained 
from the data set, '* 

- the convergence condition (5) of the cycle is 
checked between two consecutive stages, 
such that if the updating produced between two 
stages (4) is below a certain pre-established 
minimum level, it is considered that the network 20 
has converged in that cycle, 

likewise, the convergence condition (6) of the 
network is then checked, such that it is decided 
if the network should continue to grow, adding 
(7) a neuron and again initializing the cycle (3), 2s 
or if it has already stopped growing, in which 
case the algorithm process is finalized (8). 

2. A self-organizing algorithm according to claim 1, 
characterized in that the neuron updating is car- so 
ried out by using the relative network error, or also 
called the relative increment in the resource value - 
R, which is the measurement of the distance values 
between each profile and its corresponding winning 
neuron, such that when significant drop in the re- 35 
source value R is not achieved, it is considered that 

the network has converged at the level of this cycle 
(3). 

3. A self-organizing algorithm according to claims 1 *o 
and 2, characterized in that the updating process 
consists of the neurons modifying their values by 
means of the Kohonen formula: 



in which cycle (3) the network stops growing, taking 
the original profiles and carrying out random permu- 
tations on each of them in the order of their expres- 
sion values, thus destroying the correlation existing 
between the profiles of the different genes, and cal- 
culating the distance values for each profile pair in 
orderto represent the distribution of values random- 
ly observed, such that if the network growth stops 
when all the local resources are below the set dis- 
tance value, profile clusters are obtained having a 
significantly higher likeness between than what 
would be expected randomly. 



where T| is the magnitude factor of the update of the 
i neuron dependent on its proximity to the "winning" 
neuron within the neighborhood; Cfi) is the i..n so 
neuron in the presentation number f ; , and P i is the 
j...n expression profile. 

4. A self-organizing algorithm according to the previ- 
ous claims, characterized in that it is able to stop ss 
the network growth either by means of a threshold 
• for the resources, possibly set by the user or by de- 
fining a criteria based on the data in orderto decide 
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